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ABSTRACT 


Knowledge tracing is the task of modeling each student’s 
mastery of knowledge concepts (KCs) as (s)he engages with 
a sequence of learning activities. Each student’s knowledge 
is modeled by estimating the performance of the student 
on the learning activities. It is an important research area 
for providing a personalized learning platform to students. 
In recent years, methods based on Recurrent Neural Net- 
works (RNN) such as Deep Knowledge Tracing (DKT) and 
Dynamic Key-Value Memory Network (DKVMN) outper- 
formed all the traditional methods because of their ability to 
capture a complex representation of human learning. How- 
ever, these methods face the issue of not generalizing well 
while dealing with sparse data which is the case with real- 
world data as students interact with few KCs. In order to 
address this issue, we develop an approach that identifies 
the KCs from the student’s past activities that are rele- 
vant to the given KC and predicts his/her mastery based 
on the relatively few KCs that it picked. Since predictions 
are made based on relatively few past activities, it handles 
the data sparsity problem better than the methods based 
on RNN. For identifying the relevance between the KCs, 
we propose a self-attention based approach, Self Attentive 
Knowledge Tracing (SAKT). Extensive experimentation on 
a variety of real-world dataset shows that our model out- 
performs the state-of-the-art models for knowledge tracing, 
improving AUC by 4.43% on average. 


Keywords 
Knowledge Tracing, Massive Open Online Courses, Self- 
attention, sequential recommendation 


1. INTRODUCTION 


The availability of massive dataset of students’ learning tra- 
jectories about their knowledge concepts (KCs), where a KC 
can be an exercise, a skill or a concept, has attracted data 
miners to develop tools for predicting students’ performance 
and giving proper feedback [8]. For developing such person- 
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Figure 1: Left subfigure shows the sequence of exercises that the 
student attempts and the right subfigure shows the knowledge 
concepts to which each of the exercises belong. 


alized learning platforms, knowledge tracing (KT) is consid- 
ered to be an important task and is defined as the task of 
tracing a student’s knowledge state, which represents his/her 
mastery level of KCs, based on his/her past learning ac- 
tivities. The KT task can be formalized as a supervised 
sequence learning task - given student’s past exercise inter- 
actions X = (x1, X2,...,Xz), predict some aspect of his/her 
next interaction x44+1. On the question-answering platform, 
the interactions are represented as x; = (e+, rz), where e; is 
the exercise that the student attempts at timestamp ¢t and 
rz; is the correctness of the student’s answer. KT aims to 
predict whether the student will be able to answer the next 
exercise correctly, i.e., predict p(ri41 = llet+1, X). 


Recently deep learning models such as Deep Knowledge Trac- 
ing (DKT) [6] and its variant [10] used Recurrent Neural 
Network (RNN) to model a student’s knowledge state in 
one summarized hidden vector. Dynamic Key-value mem- 
ory network (DKVMN) [11] exploited Memory Augmented 
Neural Network [7] for KT. Using two matrices, key and 
value, it learns the correlation between the exercises and 
the underlying KC and student’s knowledge state, respec- 
tively. The DKT model faces the issue of its parameters 
being non-interpretable [4]. DKVMN is more interpretable 
than DKT as it explicitly maintains a KC representation 
matrix (key) and a knowledge state representation matrix 
(value). However, since all these deep learning models are 
based on RNNs, they face the issue of not generalizing while 
dealing with sparse data [3]. 


In this paper, we propose to use a purely attention mech- 
anism based method, transformer [9]. In the KT task, the 
skills that a student builds while going through the sequence 
of learning activities, are related to each other and the per- 
formance on a particular exercise is dependent on his per- 
formance on the past exercises related to that exercise. For 
example, in figure 1, for a student to solve an exercise on 
“Quadratic equation” (exercise 5) which belongs to the knowl- 
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edge concept “Equations”, he needs to know how to find 
“square roots” (exercise 3) and “linear equations” (exercise 
4). SAKT, proposed in this paper first identifies relevant 
KCs from the past interactions and then predicts student’s 
performance based on his/her performance on those KCs. 
For predicting student’s performance on an exercise, we used 
exercises as KCs. As we show later, SAKT assigns weights 
to the previously answered exercises, while predicting the 
performance of the student on a particular exercise. The 
proposed SAKT method significantly outperforms the state- 
of-the-art KT methods gaining a performance improvement 
of 4.43% on the AUC, on an average across all datasets. 
Furthermore, the main component (self-attention) of SAKT 
is suitable for parallelism; thus, making our model order of 
magnitude faster than RNN based models. 


| Correctness prediction | 


t 


| Feed forward network | 


(a) Network of SAKT. At each timestamp the attention 
weights are estimated for each of the previous element 
only. Keys, Values and Queries are extracted from the 
embedding layer shown below. When jth element is query 
and ith element is key, attention weight is ai,;. 
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(b) Embedding layer embeds the current exercise that the 
student is attempting and his past interactions. At every 
time stamp t+1, the current question e¢+1 is embedded in 
the query space using Exercise embedding and elements 
of past interactions x; is embedded in the key and value 
space using the Interaction embedding. 


Figure 2: Diagram showing the architecture of SAKT. 


2. PROPOSED METHOD 


Our model predicts whether a student will be able to an- 
swer the next exercise e;4; based on his previous interac- 
tion sequence X = xj,X2,...,xz. As shown in figure 2, 
we can transform the problem into a sequential modeling 


Table 1: Notations 


Notations Description 
total number of students 
total number of exercises 


ith exercise-answer pair of a student 
maximum length of sequence 

latent vector dimensionality 

Sequence of exercises solved by the student 
Interaction embedding matrix 

Positional embedding matrix 

Exercise lookup matrix 


Past interactions embedding 


PSomuege is 8 hms 


Exercise embedding 


problem. It is convenient to consider the model with inputs 
X1,X2,.--,X+#—-1 and the exercise sequence with one position 
ahead, e2,e3,...,e¢ and the output being the correctness 
of the response to exercises r2,73,...,7t- The interaction 
tuple x, = (e:,7rz) is presented to the model as a number 
ye = er +7+ X E, where E is the total number of exercises. 
Thus, the total values that an element in the interaction 
sequence can take is 2E, while elements in the exercise se- 
quence can take EF possible values. 


We now describe the different layers of our architecture. 


Embedding layer: We transform the obtained input se- 
quence y = (yi, Yy2,---, Yt) into s = (1, 52,...,5n), where 
n is the maximum length that the model can handle. Since 
the model can work with inputs of fixed length sequence, if 
the sequence length, t is less than n, we repetitively add a 
padding of question-answer pair to the left of the sequence. 
However, if t is greater than n, we partition the sequence 
into subsequences of length n. Specifically, when ¢ is greater 
than n, yz is partitioned into t/n subsequences each of length 
n. All these subsequences serve as input to the model. 

We train an Interaction embedding matric, M € R?”*4, 
where d is the latent dimension. This matrix is used to ob- 
tain an embedding, M., for each element, s; in the sequence. 
Similarly, we train exercise embedding matrix, E € R@*¢ 
such that each exercise in the set e; is embedded in the e;th 
row. 

Position Encoding: Position Encoding is the layer in the 
self-attention neural network which is used for encoding the 
position so that like convolution network and recurrent neu- 
ral network, we can encode the order of the sequence. This 
layer is particularly important in knowledge tracing problem 
because a student’s knowledge state evolves gradually and 
steadily with time. The knowledge state at a particular time 
instance should not show wavy transitions [10]. In order to 
incorporate this we use a parameter, position embedding, 
P € R"*? which is learned while training. The ith row of 
position embedding matrix, P; is then added to the interac- 
tion embedding vector of the ith element of the interaction 
sequence. 


The output from the embedding layer is embedded interac- 
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Interaction sequence of a student: (1, %2,..., +t) 


tion input matrix, M and embedded exercise matrix, E: 


[Ma ef Pi] [Ea] 
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Self-attention layer: In our model, we use the scaled dot- 
product attention mechanism [9]. This layer finds the rel- 
ative weight corresponding to each of the previously solved 
exercise for predicting the correctness of the current exercise. 


(1) 


We obtain query and key-value pairs using the following 
equations: 


Q=EW?, K =Mw*,v=Mw’, (2) 


where W°, W%, WY € R**@ are the query, key and value 
projection matrices, respectively, which linearly project the 
respective vectors to different space [9]. The relevance of 
each of the previous interactions with the current exercise 
is determined using the attention weights. For finding the 
attention weights we use the scaled dot product [9], defined 
as: 


Attention(Q, K, V) = softmax( a ) V. (3) 


Mutiple heads: In order to jointly attend to information from 
different representative subspaces, we linearly project the 
queries, keys and values h times using different projection 
matrices. 


Multihead(M, E) = Concat(headi,..., head,)W®, (4) 


where head; = Attention(EW®, Mws, MWY) and W? € 
Rraxd 

Causality: 

In our model, we should consider only first ¢ interactions 
when predicting the result of the (¢ + 1)st exercise. There- 
fore, for a query Q,, the keys K; such that 7 > 7 should not 
be considered. We use, causality layer to mask the weights 
learned from a future interaction key, 


Feed Forward layer: 

The self-attention layer described above results in weighted 
sum of values, V; of the previous interactions. However 
the rows of the matrix obtained from the multihead layer, 
S = Multihead(M, E) is still a linear combination of the 
values, V; of the previous interactions. To incorporate non- 
linearity in the model and consider the interactions between 
different latent dimensions, we use a feed forward network. 


F = FFN(S) = ReLU(SW™ +b) wW®? +b? (5) 


where W“) € R¢X4, W?) © RX? pO ER? b?) © R4 are 
parameters learned during training. 


Residual Connections: The residual connection [2] are 
used to propagate the lower layer features to the higher lay- 
ers. Hence, if low layer features are important for predic- 
tion, the residual connection will help in propagating them 
to the final layers where the predictions are performed. In 
the context of KT, students attempt exercises belonging to 
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a specific concept to strengthen that concept. Hence, resid- 
ual connection can help propagating the embeddings of the 
recently solved exercises to the final layer making it easier 
for model to leverage the low layer information. A residual 
connection is applied after both self-attention and feed for- 
ward layer. 


Layer normalization: In [1], it was shown that normal- 
izing inputs across features can help in stabilizing and ac- 
celerating neural networks. We used layer normalization in 
our architecture for the same purpose.Layer normalization 
is also applied at both the self-attention and feed forward 
layer. 


Prediction layer: 

Finally, each row of the matrix F; obtained above is passed 
through the fully connected network with Sigmoid activation 
to predict the performance of the student. 


pi = Sigmoid(F;w + b), (6) 


where p; is a scalar and represents the probability of student 
providing correct response to exercise e;, F; is the ith row 
of F and Sigmoid(z) = 1/(1+e7*) 


Network Training: The objective of training is to min- 
imize the negative log likelihood of the observed sequence 
of student responses under the model. The parameters are 
learned by minimizing the cross entropy loss between p; and 
Tt. 


L = —Xi(rtlog(pt) + (1 — re) log(1 — pz)) (7) 


3. EXPERIMENTAL SETTINGS 
3.1 Datasets 


To evaluate our model, we used four real-world datasets and 
one synthetic dataset. 


e Synthetic’: This dataset is obtained by simulating 
4000 virtual students’ answering trajectories. Each 
student answers the same sequence of 50 exercises, 
which are drawn from 5 virtual concepts with vary- 
ing difficulty level. 


e ASSISTment 200% (ASSIST2009): This dataset is 
provided by ASSISTment online tutoring platform and 
is widely used for KT tasks. We conducted our ex- 
periments on the updated ”skill-builder” dataset. The 
dataset is sparse as the density of this dataset is 0.06, 
shown in Table 2. 


e ASSISTment 2015° (ASSIST2015):ASSISTment 2015 
contains students’ responses on 100 skills. There are 
19,917 students and 708,631 interactions. Although 
the number of records in this dataset is more than 
ASSISTment 2009, the average number of records per 
student is smaller because the number of students is 
larger. This dataset is the most sparse of all the avail- 
able datasets, with a density of 0.05. 


‘https: //github.com /chrispiech /DeepKnowledgeTracing/tree/ 
master /data/synthetic 


"https: //sites.google.com/site/assistmentsdata/home/assistment- 


2009-2010-data/skill-builder-data-2009-2010 
“https: //sites.google.com/site/assistmentsdata/home/2015- 
assistments-skill-builder-data 
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Table 2: Dataset Statistics 


Skill # Unique 
Datasets #Users # Interactions Density 
tags Interactions 
Synthetic-5 4000 50 200K 200K 1 
ASSIST2009 4417 124 328K 35K 0.06 
ASSIST2015 19917 100 709K 102K 0.05 
ASSIST-Chall 686 102 943K 57K 0.81 
STATICS 333 1223 190K 129K 0.31 


The columns corresponding to #Users, #Skill tags and #Interactions 


represent the number of students, total number of exercise tags and the 
number of records, respectively. The column Density represents the den- 
sity of each dataset (i.e., Density = #Unique Interactions/(#Users x 
#Skill tags)). 


e ASSISTment Challenge (ASSISTChall): This data is 
obtained from ASSISTment 2017 competition*. It is 
the richest dataset in terms of the number of interac- 
tions with 942,816 interactions, 686 students and 102 
skills. This dataset is the most dense dataset of all the 
available datasets because its density is 0.81. 


e STATICS2011 (STATICS): This dataset contains the 
interaction from an engineering statics course with 189,927 
interactions, 333 students and 1223 skill tags. We 
adopted the processed data from [11]. It is also a dense 
dataset with a density of 0.31. 


The complete statistical information for all the datasets can 
be found in Table 2. 


3.2 Evaluation Methodology 


Metrics: The prediction task is considered in a binary clas- 
sification setting i.e., answering an exercise correctly or not. 
Hence, we compare the performance using the Area Under 
Curve (AUC) metric. 

Approaches: We compare our model against the state-of- 
the-art KT methods, DKT [6], DKT+ [10], and DKVMN [11]. 
These methods are described in the introduction. 

Model Training and parameter selection: We trained 
the model with 80% of the dataset and test it on the remain- 
ing. For all the methods, we tried the hidden state dimen- 
sion d = {50, 100, 150, 200}. For the competing approaches, 
we used the same hyperparameters as reported in their re- 
spective papers. For initialization of weights and optimiza- 
tion, we used a similar procedure as [10]. We implemented 
SAKT with Tensorflow and used ADAM [5] optimizer with 
learning rate of 0.001. We used a batch size of 256 for the 
ASSISTChall dataset and 128 for the others. For datasets 
with a larger number of records, e.g., ASSISTChall and AS- 
SIST2015, we used a dropout rate of 0.2, while for the re- 
maining datasets, we used a dropout rate of 0.2. We set the 
maximum length of the sequence, n as roughly proportional 
to the average exercise tags per student. For ASSISTChall 
and STATICS dataset we use n = 500, for the ASSIST 2009 
n = 100 and 50, for the synthetic and ASSIST 2015 datasets 
n is set to 50. 


“https: / /sites.zoogle.com /view /assistmentsdatamining 


Table 3: Student Performance prediction comparison. 


Datasets AUC 
DKT DKT+ DKVMN SAKT_ Gain% 


Synthetic 0.823 0.824 0.822 0.832 0.97 
ASSIST2009 0.820 0.822 0.816 0.848 3.16 
ASSIST2015 0.736 0.737 0.727 0.854 15.87 
ASSISTChall 0.734 0.728 0.689 0.734 0.00 
STATICS 0.815 0.835 0.814 0.853 2.16 
Average 0.786 0.789 0.773 0.824 4.43 


1 Bold numbers are the best performance. 
? The reported results are obtained by the best hyperparameter selec- 


tion for each dataset individually. 


4. RESULTS AND DISCUSSION 


Student Performance Prediction: Table 3 shows the 
performance comparison of SAKT with the current state- 
of-the-art methods. On the Synthetic dataset, SAKT per- 
forms better than the competing approaches, achieving an 
AUC of 0.832 compared to 0.824 by DKT+. Even though 
Synthetic is the most dense dataset, SAKT outperforms 
RNN based methods because of the methodology used for 
generating Synthetic. For this dataset, each individual ex- 
ercise is derived from only one concept. The probability 
of a student answering an exercise from this dataset cor- 
rectly is determined using Item Response Theory [8] as, 
p(correct|a, 3) = c+ Ttesticay? where c denotes the prob- 
ability of guessing it correctly, a and 2 are randomly chosen 
numbers to indicate the concept ability and exercise diffi- 
culty, respectively. Thus, in this dataset, the exercises be- 
longing to the same concept are strongly correlated. SAKT, 
unlike other benchmarks, directly attempts to identify ex- 
ercises belonging to the same concept and hence performs 
better than other methods. On ASSIST2009, SAKT per- 
forms better than competing approaches, gaining a perfor- 
mance improvement of 3.16% over the second best perform- 
ing method. For ASSIST2015 dataset, SAKT shows an im- 
pressive improvement of 15.87%. We attribute this gain to 
the fact that attention mechanism leveraged by SAKT can 
learn and generalize well even when the dataset is sparse, 
which is the case with ASSIST2015 as its density is the least 
among the other datasets. For STATICS2011, our method 
achieves a performance improvement of 2.16% compared to 
DKT+. For ASSISTChall, our method performs at par with 
DKT. This can be attributed to the fact that ASSISTChall 
is the most dense dataset of all the real-world datasets. 

Attention weights visualization: Visualizing the atten- 
tion weights between the elements of past interactions (which 
serve as keys) and the exercise that the student is going to 
solve next (which serves as query) can help in understand- 
ing which exercises in the past interactions are relevant to 
the query exercise. With this motivation, we compute the 
sum of attention weights of each exercise pair (el, e2) across 
all the sequences where el serves as query and interaction 
with exercise e2 serves as key. We then normalize the atten- 
tion weights so that the sum of the weights for each query 
is one. This results in a relevance matrix in which each 
element, (e1,e2) represents the influence of e2 on el. We 
perform our analysis on Synthetic because this dataset was 
generated with known hidden concepts and hence the ground 
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Table 4: Example of attention weights for some sequences in ASSIST2009 dataset. 


Exercise tag 


Scale Factor 
(Division Fractions, 0): 0.99 
Ordering integers 
(Proportion, 1):0.033 
Rate 


(Probability of Two Distinct Events,1): 


Past Interactions 


0.000001, (Circle Graph, 1): 0.0001, (Circle Graph,1):0.001, 


(Intercepts,0): 0.21, (Ordering positive decimals,1): 0.611, (Multiplication whole numbers,1): 0.09, 


(Interior Angles Figures, 0):0.005, (Algebraic Simplification,0) : 0.009, (Rate,0):0.5, (Interior Angles 


Figures, 0):0.1, (Algebraic Simplification,0) : 0.12 


The columns corresponding to Exercise tag refers to the query (i.e., the exercise for which we have to predict the student’s 
performance) and Past Interactions refers to the sequence of interactions that has been observed for that student, respectively. 
red colored elements in the right column represent the most important element among the past interaction elements 
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(b) Graph depicting the relevance between exercises. The 
relevance is determined by the attention weights learned 
between the exercises using SAKT. We observe a perfect 
clustering of latent concepts. 


Figure 3: Visualizing attention weight of Synthetic dataset. 
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truth regarding the relevance of different exercises are known 
to us. Figure 3a shows the heatmap corresponding to the 
relevance matrix of exercises in Synthetic. For Synthetic, 
all the sequences consist of all exercise tags in the same se- 
quence starting from 1 to 50. 

In order to build the influence graph between the exercise 
tags, as shown in Figure 3b, we use the relevance matrix. 
Firstly, we draw out the first exercise in the sequence that 
belongs to each hidden concept, and visit each row of the 
relevance matrix, and connect the exercise corresponding 
to that row to the first two exercises ranked based on edge 
weight, which is proportional to the attention weights be- 
tween the pair of exercises. We can see that the based on 
the attention weights, we are able to achieve the perfect 
clustering of the exercise tags based on the hidden concepts 
from which they are derived. An interesting observation is 
that two exercises which occur far apart in the sequence but 
belonging to the same concept can be identified by SAKT. 
For example, as shown Figure 3b a query on exercise 22 as- 
signed most weight to the key with exercise 5 even when 
they occur far apart in the sequence. 

Two exercises which are relevant to each other tend to have 
high attention weights as the performance on one of them 
impacts the performance on the other. Additionally, in the 
real-world scenario, the exercises which occur close in the 
sequence tend to belong to the same concept. Thus, we ex- 
pect that the attention weights biased towards the exercises 
that occur recently in the interaction sequence. To illustrate 
this, we manually analyzed ASSIST2009 dataset to visual- 
ize the attention weights for some selected samples. Table 4 
shows some of the exercises along with the past interactions 
and attention weights assigned to each interaction. 
Ablation Study: Table 4 shows the performance of default 
SAKT architecture and all the variants on all the datasets 
(with d = 200). 

No Positional Encoding (PE): In this variant of the default 
architecture, we removed the positional encoding. As a re- 
sult, the attention weights assigned for predicting the per- 
formance of student on a particular exercise depends only 
on the interaction embedding, without being affected by its 
position in the sequence. In case of ASSIST2009 and AS- 
SIST2015, the dataset is sparse and hence the impact of 
removal of PE is not much pronounced as is the case with 
the dense dataset such as ASSISTChall and STATICS. 

No Residual Connection (RC):RCs shows the importance of 
low level features i.e., the interaction embedding while mak- 
ing the prediction. Since our architecture is not very deep, 
the RC do not contribute much to the performance of the 
model. In fact removal of residual connection gives better 
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Table 5: Ablation Study 


ASSIST ASSIST ASSIST 


Architecture Synthetic STATICS 
2009 2015 Chall 
Default 0.832 0.848 0.854 0.734 0.853 
No PE 0.827 0.842 0.849 0.715 0.832 
No RC 0.823 0.847 0.857 0.709 0.834 
No Dropout 0.832 0.845 0.851 0.711 0.840 
Single head 0.823 0.828 0.845 0.709 0.851 
0 block 0.826 0.837 0.822 0.634 0.819 
2 blocks 0.827 0.840 0.853 0.724 0.845 


performance than default for the ASSIST2015 dataset. 

No Dropout: Dropout is used in neural network to regular- 
ize the model so that it can generalize better. Overfitting 
of the model is more effective for dataset with less number 
of records compared to the number of parameters of model. 
As a result, role of dropout is more effective for ASSIST2009 
dataset and STATICS dataset. 

Single head: Instead of using 5 heads as is the case in de- 
fault architecture, we tried a variant of using only one head. 
Multiple heads help in capturing the attention weights in dif- 
ferent subspaces. Using single head consistently drops the 
performance of SAKT on all the datasets. 

No block: When no self-attention block is used the predic- 
tion of the next exercise depends only on the last interaction. 
It can be seen that without attention block the performance 
is significantly worse than that of default architecture. 

2 Blocks: Increasing the number of blocks of self-attention 
increases the number of parameters of the model. However, 
in our case this increase of parameters does not prove to 
be useful in improving the performance. The reason being 
an important aspect of prediction of performance of student 
at an exercise is dependent on his performance on the past 
relevant exercises. Adding another block of self-attention 
makes the model more complex. 

Training efficiency: Figure 4 demonstrates the efficiency 
of various methods based on their run times on GPU during 
the training phase. Comparing the computational efficiency, 
SAKT only spends 1.4 seconds in one epoch which is 46.42 
less than the time taken by DKT+ (65 seconds/epoch), 32 
times less than DKT (45 seconds/epoch) and 17.33 times 
less than DKVMN (26 seconds/epoch). We conducted the 
experiments on a single GPU of type NVIDIA Titan V. 
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Figure 4: Training Efficiency on ASSIST2009 dataset. 


5. CONCLUSION AND FUTURE WORK 


In this work, we proposed a self-attention based knowledge 
tracing model, SAKT. It models a student’s interaction his- 
tory (without using any RNN) and predicts his performance 
on the next exercise by considering the relevant exercises 
from his past interactions. Extensive experimentation on 
a variety of real-world datasets shows that our model can 
outperform the state-of-the-art methods and is an order of 
magnitude faster than the RNN-based approaches. 
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